Currently only single-page TIFFs are supported. If you have multi-page
TIFFs consider splitting them into individual images.
Note that this feature works best for computer-rendered text rather than
handwriting.
The underlying technology is Tesseract.js.
PDFs
Currently only text extraction is supported and not OCR.
This means that the PDF needs to have proper text information in it (i.e.
the text can be selected in a PDF viewer), whereas scanned documents are
not yet supported.
There are plans to integrate the same OCR-based recognition for PDFs used
for images, but this is not yet implemented.
Office documents
The text will be extracted from the following file formats:
Microsoft Word documents
Microsoft Excel documents (only the raw text information, the cell structure
is not maintained).
Microsoft PowerPoint documents
The OpenDocument alternatives to the previous formats (Text, Spreadsheet,
Presentation), created by editors such as LibreOffice and OpenOffice.
Configuring and triggering OCR
The OCR can be configured by going to Options →
Media and looking for the Text Extraction (OCR) section.
There are three ways to trigger the OCR:
By enabling Auto-process new files which will process only the
notes or attachments created after enabling the option, existing files
will remain unprocessed.
By pressing Start Batch Processing which will process all the existing
notes.
By manually requesting for an image or file to have its text extracted,
regardless of whether the automatic processing is enabled or not.
Minimum confidence
When extracting text from an image, there is a certain level of confidence
which indicates whether the extracted text appears relevant.
When the minimum confidence is set to a low percentage, the text extraction
can interpret symbols and drawings incorrectly resulting in garbled text.
If the extracted text for a note or an attachment quality is lower than
the minimum confidence, the OCR is disregarded.
Language management
OCR needs to be aware of the language of the content in order for it to
work correctly. The reason is that each language has its own data which
needs to be downloaded, and accents or other symbols will not be supported
by the default language.
To configure the languages that are supported by the OCR, simply go to
Options → Language & Region and
adjust the Content languages.
When there are no content languages defined, the user interface Language is
used instead.
After making this change, the automatic processing or manual reprocessing
will take into consideration the new languages.
To enforce the detection in a particular language for a given note, use
the languageattribute,
similar to text content language.
For Attachments,
it's not possible to manually adjust the language.
Viewing extracted content for a single note
To access the extracted content of a note:
For File notes,
go to the Note buttons → Advanced → View OCR Text.
For Attachments (e.g.
Images in Text notes),
double-click the attachment to view the details, press the […] button at
the left and select View extracted text (OCR).
This section allows:
Viewing the extracted text, which can be copied elsewhere if needed or
just to check the quality of the extraction.
If the note has not been extracted yet, pressing Process OCR will
process it in the background. If the extraction confidence is lower than
the minimum confidence, there will be a notification.
Similarly, if the minimum confidence was changed in settings, it is possible
to press the Process OCR button again to extract the text again.